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ABSTRACT 


A  formal  framework  for  distribution-free  concept  known  as  Valiant’s  learning 
framework  has  generated  a  great  deal  of  interest.  A  fundamental  result  regarding 
this  framework  characterizes  those  concept  classes  which  are  learnable  in  terms  of 
their  Vapnik-Chervonenkis  (VC)  dimension.  More  recently,  learnability  with  respect 
to  a  fixed  probability  distribution  (a  variant  of  the  original  distribution-free  frame¬ 
work)  has  been  studied  and  an  analogous  result  characterizing  learnability  in  this 
case  was  shown.  Also  a  conjecture  regarding  learnability  for  a  class  of  distributions 
was  stated. 


In  this  report,  we  first  point  out  that  the  condition  for  learnability  for  a  fixed 
distribution  is  equivalent  to  the  notion  of  finite  metric  entropy  (which  has  been 
studied  in  other  contexts).  Some  relationships  between  the  VC  dimension  of  a 
concept  class  and  its  metric  entropy  with  respect  to  various  distributions  are  then 
discussed.  Finally,  we  prove  some  partial  results  regarding  learnability  for  a  class 
of  distributions,  which  provide  some  indication  of  when  the  set  of  learnable  concept 
classes  is  enlarged  by  requiring  learnability  for  only  a  class  of  distributions.  /. 
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1.  INTRODUCTION 


In  [23],  Valiant  proposed  a  precise  framework  to  capture  the  notion  of  what  we  mean  by 
“learning  from  examples.”  The  essential  idea  consists  of  approximating  an  unknown  “concept” 
from  a  finite  number  of  positive  and  negative  “examples”  of  the  concept.  For  example,  the  concept 
might  be  some  unknown  geometric  figure  in  the  plane,  and  the  positive  and  negative  examples  are 
points  inside  and  outside  the  figure,  respectively.  The  goal  is  to  approximate  the  figure  from  a 
finite  number  of  such  points.  The  examples  are  assumed  to  be  drawn  according  to  some  probability 
distribution.  The  same  distribution  is  used  to  evaluate  how  well  a  concept  is  learned.  However,  no 
assumptions  are  made  about  which  particular  distribution  is  used.  That  is,  learning  is  required  to 
take  place  for  every  distribution. 

Valiant’s  seminal  paper  [23]  has  led  to  a  large  amount  of  work  analyzing  and  extending  the 
formal  learning  framework  which  was  originally  proposed.  A  fundamental  paper  was  written  by 
Blumer  et  al.  [6]  which  gave  a  characterization  of  learnability  for  the  distribution-free  framework 
in  terms  of  a  combinatorial  parameter  which  measures  the  “size”  of  a  concept  class.  Benedek  and 
Itai  [4]  studied  a  variation  of  Valiant’s  learning  framework  in  which  the  examples  are  assumed  to 
be  drawn  from  a  fixed  and  known  distribution.  In  this  case,  a  characterization  of  learnability  was 
given  in  terms  of  a  different  measure  of  the  size  of  a  concept  class. 

In  Section  2,  we  give  some  definitions,  a  precise  description  of  the  learning  framework,  and 
some  previous  results  from  [6]  and  [4].  The  definitions  and  notation  used  are  essentially  those  from 
[6],  which  are  a  slight  variation  of  those  originally  given  in  [23].  The  major  result  of  [6]  states 
that  a  concept  class  is  learnable  for  every  distribution  iff  it  has  finite  Vapnik-Chervonenkis  (VC) 
dimension.  An  analogous  result  of  [4]  characterizes  learnability  for  a  fixed  distribution.  We  point  out 
that  this  characterization  is  identical  to  that  of  finite  metric  entropy,  which  has  been  studied  in  other 
contexts.  The  results  characterizing  learnability  suggest  that  there  may  be  relationships  between 
the  VC  dimension  of  a  concept  class  and  its  metric  entropy  with  respect  to  various  distributions. 
Some  such  relationships,  in  addition  to  those  investigated  in  [4],  are  discussed  in  Section  3.  We  state 
an  earlier  result  from  [8]  and  prove  a  new  result,  both  of  which  offer  some  improvements  on  different 
results  of  [4],  In  Section  4,  we  consider  learnability  for  a  class  of  distributions,  which  is  a  natural 
extension  of  learnability  for  a  fixed  distribution.  Benedek  and  Itai  [4]  posed  the  characterization  of 
learnability  in  this  case  as  an  open  problem.  They  conjectured  that  a  concept  class  is  learnable  with 
respect  to  a  class  of  distributions  iff  the  metric  entropy  of  the  concept  class  with  respect  to  each 
distribution  is  uniformly  bounded  over  the  class  of  distributions.  We  prove  some  partial  results 
for  this  problem.  Although  the  results  we  prove  are  far  from  verifying  the  conjecture  in  general, 
they  are  consistent  with  it.  Furthermore,  they  provide  some  indication  of  conditions  when  power 
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is  gained  by  requiring  learnability  only  for  a  class  distributions  rather  than  for  all  distributions. 
Finally,  in  Section  5,  we  briefly  summarize  and  mention  some  related  work  that  has  been  done  on 
Valiant’s  learning  framework. 
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2.  DEFINITIONS  AND  PREVIOUS  RESULTS  CHARACTERIZING 

LEARN  ABILITY 


In  this  section,  we  describe  the  formal  model  of  learning  introduced  by  Valiant  [23]  (learnability 
for  all  distributions)  and  a  variant  (learnability  for  a  fixed  distribution).  We  also  state  previous 
results  characterizing  learnability  in  these  cases.  The  result  of  Blumer  et  al.  [6]  characterizes 
learnability  for  all  distributions  in  terms  of  a  quantity  known  as  the  VC  dimension.  The  result  of 
Benedek  and  Itai  [4]  characterizes  learnability  for  a  fixed  distribution  in  terms  of  a  quantity  that 
is  essentially  metric  entropy. 

Informally,  Valiant’s  learning  framework  can  be  described  as  follows.  The  learner  wishes  to 
learn  a  concept  unknown  to  him.  The  teacher  provides  the  learner  with  random  positive  and 
negative  examples  of  the  concept  drawn  according  to  some  probability  distribution.  Prom  a  finite 
set  of  examples,  the  learner  outputs  a  hypothesis  which  is  his  current  estimate  of  the  concept.  The 
error  of  the  estimate  is  taken  as  the  probability  that  the  hypothesis  will  incorrectly  classify  the  next 
randomly  chosen  example.  The  learner  cannot  be  expected  to  exactly  identify  the  concept  since 
only  a  finite  number  of  examples  are  seen.  Also,  since  the  examples  are  randomly  chosen,  there  is 
some  chance  that  the  hypothesis  will  be  very  far  off  (due  to  poor  examples).  Hence,  the  learner 
is  only  required  to  closely  approximate  the  concept  with  sufficiently  high  probability  from  some 
finite  number  of  examples.  Furthermore,  the  number  of  examples  required  for  a  given  accuracy  and 
confidence  should  be  independent  of  the  distribution  from  which  the  examples  are  drawn.  Below, 
we  will  describe  this  framework  precisely,  following  closely  the  notation  of  [6]. 

Let  A  be  a  set  which  is  assumed  to  be  fixed  and  known.  X  is  sometimes  called  the  instance 
space.  Typically,  X  is  taken  to  be  either  Rn  (especially  R2)  or  the  set  of  binary  n-  vectors.  A  concept 
will  refer  to  a  subset  of  X,  and  a  collection  of  concepts  C  C  2X  will  be  called  a  concept  class.  An 
element  x  G  X  will  be  called  a  sample,  and  a  pair  (x,  a)  with  x  G  X  and  a  G  {0, 1}  will  be  called  a 
labeled  sample.  Likewise,  x  —  (xj, . . . ,  xm)  G  Xm  is  called  an  m-sample ,  and  a  labeled  m-sample  is  an 
m-tuple  ((xi,ai), . . . ,  ( xm,am ))  where  a*  =  aj  if  Xi  =  Xj.  For  x  =  (xj, . . .  ,xm)  G  Xm  and  c  €  C, 
the  labeled  m-sample  of  c  generated  by  x  is  given  by  samc{x)  =  ((xi,  Jc(xi)), . . . ,  (xm,/c(xm))) 
where  7C(-)  is  the  indicator  function  for  the  set  c.  The  sample  space  of  C  is  denoted  by  Sc  and 
consists  of  all  labeled  m-samples  for  all  c  G  C,  all  x  G  Xm,  and  all  m  >  1. 

Let  if  be  a  collection  of  subsets  of  A.  H  is  called  the  hypothesis  class,  and  the  elements 
of  H  are  called  hypotheses.  Let  Fch  be  the  set  of  all  functions  /  :  Sc  —*  H.  A  function  /  G 
Fch  is  called  consistent  if  it  always  produces  a  hypothesis  which  agrees  with  the  samples,  i.e. 
whenever  h  =  f((x\,ai), . . . ,  (xm, om))  we  have  Ih(xi)  =  a ,  for  i  =  1 ,m.  Given  a  probability 
distribution  P  on  X,  the  error  of  /  with  respect  to  P  for  a  concept  c  G  C  and  sample  x  is  defined 
as  error =  P(cAh)  where  h  =  f(samc(x))  and  cAh  denotes  the  symmetric  difference  of  the 
sets  c  and  h.  Finally,  in  the  definition  of  learnability  to  be  given  below,  the  samples  used  in  forming 
a  hypothesis  will  be  drawn  from  X  independently  according  to  the  same  probability  measure  P. 
Hence,  an  m-sample  will  be  drawn  from  Xm  according  to  the  product  measure  Pm. 


3 


We  can  now  state  the  following  definition  of  learnability  for  every  distribution,  which  is  the  ver¬ 
sion  from  Blumer  et  al.  [6]  of  Valiant’s  [23]  original  definition  (without  restrictions  on  computational 
complexity  -  see  below). 


Definition  1  (Learnability  for  Every  Distribution)  The  pair  (C,  H)  is  learnable  if  there  ex¬ 
ists  a  function  f  G  Fch  such  that  for  every  e,6  >  0  there  is  a  0  <  m  <  oo  such  that  for  every 
probability  measure  P  and  every  c  €  C,  ifxe  Xm  is  chosen  at  random  according  to  Pm  then  the 
probability  that  errorftCyp(x)  <  e  is  greater  than  1-6. 

Several  comments  concerning  this  definition  are  in  order.  First,  learnability  depends  on  both 
the  concept  class  C  and  the  hypothesis  class  H,  which  is  why  we  defined  learnability  in  terms  of 
the  pair  (C,  H).  However,  in  the  literature  the  case  H  D  C  is  often  considered,  in  which  case,  for 
convenience,  we  may  speak  of  learnability  of  C  in  place  of  (C,C).  Second,  the  sample  size  m  is 
clearly  a  function  of  e  and  6  but  a  fixed  m  =  m(e,  6)  must  work  uniformly  for  every  distribution 
P  and  concept  c  G  C.  Because  of  this,  the  term  distribution-free  learning  is  often  used  to  describe 
this  learning  framework.  Finally,  e  can  be  thought  of  as  an  accuracy  parameter  while  6  can  be 
thought  of  as  a  confidence  parameter.  The  definition  requires  that  the  learning  algorithm  /  output 
a  hypothesis  that  with  high  probability  (greater  than  1  -  6)  is  approximately  correct  (to  within  e). 
Angluin  and  Laird  [2]  used  the  term  probably  approximately  correct  (PAC)  learning  to  describe  this 
definition. 

A  somewhat  more  general  and  useful  definition  of  learnability  was  actually  used  by  Valiant 
in  [23]  and  later  by  others.  This  definition  incorporates  both  a  notion  of  the  size  or  complexity 
of  concepts  and  the  central  idea  that  the  learning  algorithm  (i.e.,  the  function  which  produces  a 
hypothesis  from  labeled  samples)  should  have  polynomial  complexity  in  the  various  parameters. 
Other  variations  of  this  definition,  such  as  seeing  positive  examples  only,  or  having  the  choice  of 
positive  or  negative  examples,  have  also  been  considered.  Some  equivalences  among  the  various 
learnability  definitions  were  shown  in  [10].  In  this  report,  we  will  not  consider  these  variations. 
Also,  we  will  be  considering  the  case  that  H  D  C  throughout,  so  that  we  will  simply  speak  of  the 
learnability  of  C  rather  than  learnability  of  ( C ,  H). 

A  fundamental  result  of  Blumer  et  al.  [6]  relates  learnability  for  every  distribution  to  the 
Vapnik-Chervonenkis  (VC)  dimension  of  the  concept  class  to  be  learned.  The  notion  of  VC  dimen¬ 
sion  w?s  introduced  in  [25]  and  has  been  studied  and  used  in  [8,26,11],  Many  interesting  concept 
classes  have  been  shown  to  have  finite  VC  dimension. 
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Definition  2  (Vapnik-Chervonenkis  Dimension)  Let  C  C  2X .  For  any  finite  set  S  C  X ,  let 
IIc(S)  =  {Sflc  :  c  €  C}.  5  is  said  to  be  shattered  by  C  if!lc{S)  =  2s .  The  Vapnik-Chervonenkis 
dimension  ofC  is  defined  to  be  the  largest  integer  d  for  which  there  exists  a  set  S  C  A"  of  cardinality 
d  such  that  S  is  shattered  by  C.  If  no  such  largest  integer  exists  then  the  VC  dimension  of  C  is 
infinite. 

A  concept  class  C  will  be  called  trivial  if  C  contains  only  one  concept  or  two  disjoint  concepts. 
In  [6],  a  definition  was  also  given  for  what  they  called  a  well-behaved  concept  class,  which  involves 
the  measurability  of  certain  sets  used  in  the  proof  of  their  theorem.  We  will  not  concern  ourselves 
with  the  definition  here.  The  following  theorem  is  stated  exactly  from  [6]  and  was  their  main  result. 

Theorem  1  For  any  nontrivial,  well-behaved  concept  class  C,  the  following  are  equivalent: 

(i)  The  VC  dimension  of  C  is  finite. 

(ii)  C  is  leamable. 

(Hi)  If  d  is  the  VC  dimension  of  C  then 

(a)  for  sample  size  greater  than  max(|  log  |,  ^  l°g  ~r)>  any 
consistent  function  f  €  Fch  a  learning  algorithm  for 
C,  and 

(b)  for  e  <  ^  and  sample  size  less  than  max(^  log  d(l  -2(c  + 

6  -  e6))),  no  function  f  €  Fch  where  C  C  H  is  a  learning 
algorithm  for  C. 

A  definition  of  learnability  similar  to  that  of  Definition  1  can  be  given  for  the  case  of  a  single, 
fixed,  and  known  probability  measure. 

Definition  3  (Learnability  for  a  Fixed  Distribution)  Let  P  be  a  fixed  and  known  probability 
measure.  The  pair  ( C ,  H)  is  said  to  be  learnable  with  respect  to  P  if  there  exists  a  function  f  €  Fch 
such  that  for  every  e,6  >  0  there  is  a  0  <  m  <  oc  such  that  for  every  c  €  C,  ifx€  Xm  is  chosen 
at  random  according  to  Pm  then  the  probability  that  error /,c,p(x)  <  e  is  greater  than  1-6. 

Conditions  for  learnability  in  this  case  were  studied  by  Benedek  and  Itai  [4].  They  introduced 
the  notion  of  what  they  called  a  “finite  cover”  for  a  concept  class  with  respect  to  a  distribution 
and  were  able  to  show  that  finite  coverability  characterizes  learnability  for  a  fixed  distribution. 
It  turns  out  that  their  definition  of  finite  coverability  is  identical  to  the  notion  of  metric  entropy, 
which  has  been  studied  in  other  literature.  Specifically,  the  measure  of  error  between  two  concepts 
with  respect  to  a  distribution  is  a  semi-metric  (or  pseudo-metric).  The  notion  of  finite  coverability 
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is  identical  to  the  notion  of  finite  metric  entropy  with  respect  to  the  semi-metric  induced  by  the 
distribution  P. 

We  define  metric  entropy  below,  but  first  show  that  P  induces  a  semi-metric  on  the  concept 
class.  Define  <fp(ci,  C2)  =  P{c\ Ac2)  for  c\,  c2  C  X  and  measurable  with  respect  to  P.  For  C] ,  e2  e  C, 
dp(ci,c2)  just  represents  the  error  between  c\  and  C2  that  has  been  used  throughout.  In  the  following 
proposition  we  prove  that  dp(-,  •)  defines  a  semi-metric  on  the  set  of  all  subsets  of  X  measurable 
with  respect  to  P,  and  hence  defines  a  semi-metric  on  the  concept  class  C. 

Proposition  1  For  any  probability  measure  P,  dp(ci,c2)  =  P(ciAc2)  is  a  semi-metric  on  the 
a-algebra  S  of  subsets  of  X  measurable  with  respect  to  P.  I.e.,  for  all  01,02,03  €  S 

(i)  dp(ci,c2)  >  0 

(H)  dP(ci,c2)  =  dP{c2,ci) 

(Hi)  dp(ci,cs)  <  dp(ci,C2)  +  dp(c2,C3) 

Proof:  (i)  is  true  since  P  is  a  probability  measure,  (ii)  is  true  since  C1AC2  =  C2AC1,  and  (iii)  follows 
from  subadditivity  and  the  fact  that  C1AC3  C  (C1AC2)  U  (C2AC3).  | 

Note  that  dp(-,  )  is  only  a  semi-metric  since  it  does  not  usually  satisfy  the  requirement  of  a 
metric  that  dp{c\,c2)  =  0  iff  ci  =  C2.  That  is,  Ci  and  C2  may  be  unequal  but  may  differ  on  a  set  of 
measure  zero  with  respect  to  P,  so  that  dp(ci,c2)  =  0. 

We  now  define  metric  entropy. 

Definition  4  (Metric  Entropy)  Let  ( Y,p )  be  a  metric  space.  Define  N(e)  =  N(e,Y,p)  to  be  the 
smallest  integer  n  such  that  there  exists  yi,...,yn  G  Y  with  Y  =  U"=1Pf(y,)  where  B((yi)  is  the 
open  ball  of  radius  e  centered  at  yt.  If  no  such  n  exists,  then  N(e,Y,p)  =  00.  The  metric  entropy 
ofY  (often  called  the  e-entropy,)  is  defined  to  be  log2  N(e). 

N(e)  represents  the  smallest  number  of  balls  of  radius  e  which  are  required  to  cover  Y.  For 
another  interpretation,  suppose  we  wish  to  approximate  Y  by  a  finite  set  of  points  so  that  every 
element  of  Y  is  within  e  of  at  least  one  member  of  the  finite  set.  Then  N(e)  is  the  smallest  number 
of  points  possible  in  such  a  finite  approximation  of  Y.  The  notion  of  metric  entropy  for  various 
metric  spaces  has  been  studied  and  used  by  a  number  of  authors  (e.g.,  see  [8,9,12,16,17,22]). 

The  notion  of  metric  entropy  can  still  be  used  even  if  p  is  only  a  semi-metric  rather  than 
a  metric.For  convenience,  if  P  is  a  distribution  we  will  use  the  notation  N(e,  C,  P)  (instead  of 
N(e,  C,  dp)),  and  we  will  speak  of  the  metric  entropy  of  C  with  respect  to  P,  with  the  understand¬ 
ing  that  the  semi-metric  being  used  is  dp(-,  •).  Benedek  and  Itai  [4]  proved  that  a  concept  class  C  is 
learnable  for  a  fixed  distribution  P  iff  C  has  finite  metric  entropy  with  respect  to  P.  We  state  their 
results  formally  in  the  following  theorem,  which  we  have  written  in  a  form  analogous  to  Theorem  1. 
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Theorem  2  Let  C  be  a  concept  class  and  P  be  a  fixed  and  known  probability  measure.  The  following 

are  equivalent: 

(i)  The  metric  entropy  of  C  with  respect  to  P  is  finite  for  all  e  >  0. 

(ii)  C  is  leamable  with  respect  to  P. 

(Hi)  If  N(e)  =  N(e,C,P)  is  the  size  of  a  minimal  e-approximation 
of  C  with  respect  to  P  and  =  {yi, . . . ,  J/jv(e/2)}  ts  an  approximation 

to  C  then 

(a)  for  sample  size  greater  than  (Z2/e)\n(N(e/2)/6) 

any  function  f  :  Sc  — ►  which  minimizes  the  number 

of  disagreements  on  the  samples  is  a  learning  algorithm  for 
C,  and 

(b)  for  sample  size  less  than  log2[(l  -  b)N(2e)]  no  function 
f  €  Fch  is  a  learning  algorithm  for  C. 

Note  that  in  condition  (iii)(a),  only  functions  whose  range  was  a  finite  ^-approximation  to 
C  were  considered.  As  noted  in  [4],  a  function  that  simply  returns  some  concept  consistent  with 
the  samples  does  not  necessarily  learn.  In  fact,  they  claim  that  they  found  examples  where  for 
every  finite  sample  there  are  concepts  «-far  from  the  target  concept  (even  with  e  =  1)  that  are  still 
consistent  with  the  samples.  The  following  is  a  simple  example  which  substantiates  their  claim. 
Let  X  =  [0, 1],  P  be  the  uniform  distribution  on  X,  and  C  be  the  concept  class  containing  all  finite 
sets  of  points  and  the  entire  unit  interval.  That  is, 

C  =  {{xi, . . .  ,xr}  :  1  <  r  <  00  and  Xi  €  [0, 1],  i  =  1, . . . ,  r}  U  {[0, 1]} 

If  the  target  concept  is  [0, 1]  then  for  every  finite  sample  there  are  many  concepts  that  are  consistent 
with  the  sample  but  are  c-far  (with  e  -  1)  from  [0,  lj.  Namely,  any  finite  set  of  points  which  contains 
the  points  of  the  sample  is  a  concept  with  this  property. 
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3.  RELATIONSHIPS  BETWEEN  METRIC  ENTROPY  AND  THE 
VAPNIK-CHERVONENKIS  DIMENSION 


In  the  previous  section,  we  stated  a  result  from  [6]  which  showed  that  the  VC  dimension 
of  a  concept  class  characterizes  learnability  for  every  distribution.  A  similar  result  from  [4]  was 
stated  which  showed  that  the  metric  entropy  of  a  concept  class  characterizes  learnability  for  a  fixed 
distribution.  These  two  results  naturally  suggest  that  there  may  be  some  relationships  between  the 
VC  dimension  of  a  concept  class  and  its  metric  entropy  with  respect  to  various  distributions.  This 
is  indeed  the  case.  In  this  section,  we  discuss  some  relationships  explored  by  [4],  prove  a  further 
result,  and  state  an  earlier  result  from  [8]. 

The  following  theorem  was  shown  in  [4],  and  is  stated  as  it  appeared  there. 

Theorem  3  Let  C  be  a  concept  class  of  finite  dimension  d  >  1  and  let  N{e ,  C,  P)  be  the  size  of  a 
minimum  e-cover  of  C  with  respect  to  probability  measure  P.  Then  the  following  relations  hold: 

(i)  There  is  a  distribution  P  such  that  [log2  dj  <  N(\,C,P). 

(ii)  If  e  <  then  there  is  a  distribution  P  such  that  2d  <  N(e,  C,  P). 

(Hi)  If  e  <  5  then  N(e,C,P)  <  1.002  (16d/e)16d^£  for  every  probability 
measure  P. 

The  proofs  of  these  relations  are  straightforward  and  were  given  in  [4].  However,  some  com¬ 
ments  on  each  of  these  relations  are  in  order. 

First,  a  statement  more  general  than  (ii)  can  be  made  which  does  not  depend  on  the  VC 
dimension  of  C.  Specifically,  let  xi,...,xn  €  X  be  distinct  points  and  let  £  C  be 

concepts  whose  intersection  with  {xi , . . . ,  x„}  gives  rise  to  distinct  subsets,  i.e.,  c,  D  {xi , . . . ,  x„  }  ^ 
c;  n{xi, . . . ,  x„}  for  i  ±  j.  Note  that  we  must  necessarily  have  k  <  2n.  If  we  take  P  to  be  the  uniform 
distribution  on  {xi, . . . ,  x„}  then  we  obtain  N(e,  C,  P)  >  k  for  e  <  This  reduces  to  (ii)  if  C  has 
VC  dimension  d  and  ci, . . .  ,Cj<»  are  concepts  which  shatter  the  set  of  points  {xj, . . . ,  x^}.  However, 
our  statement  is  more  general  since,  regardless  of  the  VC  dimension  of  C,  it  may  be  possible  to  find 
n  concepts  which  give  rise  to  n  distinct  subsets  of  {xi, . . . ,  x„}  so  that  N(e,  C,  P)  >  n  for  e  <  ^ . 

The  result  in  (iii)  was  obtained  somewhat  indirectly  in  [4]  by  using  upper  and  lower  bounds 
for  the  number  of  samples  required  for  learning  (from  [6]  and  [4]  respectively).  The  following  result 
along  the  lines  of  (iii)  was  shown  in  [8j.  Note  that  the  bound  does  not  appear  exactly  as  in  [8]  since 
the  definition  of  VC  dimension  used  in  (8]  corresponds  to  d  +  1. 

Proposition  2  If  C  is  a  concept  class  with  VC  dimension  d,  then  there  is  a  constant  K  =  K(d) 
such  that  for  0  <  e  <  5  we  have  N(e,  C,  P)  <  K(d)e~^d+l'>\ In e|*+1  for  every  probability  measure  P. 

For  a  fixed  concept  class  (and  hence  fixed  d),  this  bound  provides  a  much  tighter  bound  on 
N(e,C,P)  as  a  function  of  e  than  the  bound  of  (iii)  (namely,  polynomial  vs  exponential  in  I). 
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Now,  regarding  relation  (i),  we  note  that  if  the  VC  dimension  of  C  is  infinite  then  we  can  find 
a  sequence  of  distributions  Pn  for  n  =  1, 2, . . .  such  that  lim*-,,*,  N(\,C,  P„)  =  oo.  Relation  (i)  is 
proved  by  considering  the  uniform  distribution  on  a  finite  set  of  d  points  shattered  by  C.  If  the 
VC  dimension  of  C  is  infinite,  our  comment  follows  by  taking  P„  to  be  the  uniform  distribution 

over  n  points  shattered  by  C  and  using  (the  proof  of)  relation  (i)  for  each  n  =  1, 2, _ In  general, 

for  a  concept  class  of  infinite  VC  dimension,  we  may  not  necessarily  be  able  to  find  a  particular 
distribution  P  for  which  N(e,  C,  P)  =  oc,  but  will  only  be  able  to  approach  infinite  metric  entropy  by 
a  sequence  of  distributions.  However,  in  some  cases  we  can  achieve  infinite  metric  entropy  as  shown 
by  the  following  example.  Let  X  =  [0, 1]  and  let  C  be  the  set  of  all  Borel  sets.  Then  taking  P  to  be 
the  uniform  distribution,  we  have  N(\,C,P)  =  oo  since  the  infinite  collection  of  sets  corresponding 
to  the  Haar  basis  functions  (i.e.,  Cn  =  {x  €  [0, 1]  :  the  nth  digit  in  the  binary  expansion  of  x  is  1}) 
are  pairwise  a  distance  ^  apart  with  respect  to  P. 

Finally,  we  prove  a  result  which  has  a  larger  range  of  applicability  than  (ii)  and  gives  a  stronger 
dependence  on  d  than  (i)  for  f  <  |.  Although  the  bound  of  (ii)  is  exponential  in  d,  it  is  valid  only 
for  e  <  5j,  so  that  the  range  of  applicability  goes  to  zero  as  d  — *  oo.  On  the  other  hand,  (i)  is  valid 
for  a  fixed  e  independent  jf  d  (namely  e  =  \)  but  gives  only  logarithmic  dependence  on  d.  The 
following  bound  gives  exponential  dependence  on  d  for  a  fixed  range  of  applicability  (e  <  |). 

Proposition  3  If  C  is  a  concept  class  of  finite  dimension  d  >  1  then  there  is  a  probability  measure 
P  such  that 

e2<5~2 <)2<*  <  N(e,  C,  P) 

for  all  €  <  ^ . 

Proof:  Let  {xi,...,Xd}  be  a  set  of  d  points  that  is  shattered  by  C,  and  let  p  be  the  uniform 
distribution  on  {xj, . . . , Xd},  i.e.,  P(x,)  =  g  for  i  =  1, . . .  ,d.  For  this  distribution,  the  only  relevant 
property  of  a  concept  c  is  the  set  of  x,  which  are  contained  in  c.  Hence,  we  can  represent  c  by  a  d 
bit  binary  string  with  a  one  in  position  i  indicating  that  Xi  G  c,  and  we  can  identify  the  concept 
class  C  with  the  set  of  all  d  bit  binary  strings. 

If  we  can  find  n  concepts  that  are  pairwise  more  than  2e  apart,  then  N(e,  C,P)  >  n  since  each 
of  the  non-overlapping  e  balls  around  these  n  concepts  must  contain  a  member  of  an  e-cover  (see 
Lemma  3  of  [4]).  Given  two  concepts  Ci,C2  represented  as  binary  strings,  dp{c\,C2)  =  g  where  k  is 
the  number  of  bits  on  which  ci  and  C2  differ,  and  so  dp(c\,  C2)  <  2e  iff  C2  differs  from  c\  on  k  <  2 ed 
bits.  The  number  of  binary  strings  that  differ  on  k  bits  from  a  given  string  is  (£) .  Therefore,  the 
number  of  concepts  that  are  a  distance  less  than  or  equal  to  2e  from  a  given  concept  is  53o<fc<2<d  it)  ■ 
Since  the  total  number  of  concepts  is  2d,  we  can  find  at  least 


concepts  that  are  more  than  2e  apart,  so  that 

N{e,C,  P)  >  2d /  £  (f) 

0<fc<2ed  W 
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Now,  Dudley  [8]  states  the  Chernoff-Okamoto  inequality 

52  f n) Pk “  p)"-*  -  e-,r,p-m)i/[2np{1-p)] 

0 <k<m  \  / 


for  p  <  5  and  m  <  np,  which  can  be  obtained  from  a  more  general  inequality  (for  sums  of  bounded 
random  variables)  of  Hoeffding  [13].  Taking  n  =  d,  p  =  5,  and  m  =  2ed  we  obtain 

E  if) 

0<k<2td  W 

for  e  <  Using  this  in  our  earlier  bound  on  N(e,  C,  P),  we  get 

N(e,  C,  P)  >  e2^-2()2d 
for  e  <  ^  which  is  the  desired  inequality.  | 
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4.  PARTIAL  RESULTS  ON  LEARN  ABILITY  FOR  A  CLASS  OF 

DISTRIBUTIONS 


In  this  section,  we  prove  some  partial  results  regarding  learnability  for  a  class  of  distributions. 
The  definition  of  learnability  in  this  case  is  completely  analogous  to  the  definitions  given  earlier, 
but  for  completeness  we  state  it  formally. 

Definition  5  (Learnability  for  a  Class  of  Distributions)  Let  V  be  a  fixed  and  known  collec¬ 
tion  of  probability  measures.  The  pair  ( C ,  H)  is  said  to  be  learnable  with  respect  to  V  if  there 
exists  a  function  f  G  Fch  such  that  for  every  e,6  >  0  there  is  a  0  <  m  <  oc  such  that  for  every 
probability  measure  P  €  V  and  every  c  G  C,  ifx€  Xm  is  chosen  at  random  according  to  Pm  then 
the  probability  that  error ftC,p(x)  <  e  is  greater  than  1-6. 

Benedek  and  Itai  [4]  posed  the  problem  of  characterizing  learnability  for  a  class  of  distributions 
as  an  open  problem,  and  they  made  the  following  conjecture. 

Conjecture  1  A  concept  class  C  is  learnable  with  respect  to  a  class  of  distributions  V  iff  for  every 

e  >  0, 

N{e,  C ,  V)  =  sup  JV(e,  C,  P)  <  oo 
Pev 

The  notation  defined  in  the  statement  of  the  conjecture  will  be  used  throughout.  Namely,  if  V  is 
any  class  of  distributions,  then  N(e,  C,  V)  is  defined  by  N{e,  C ,  V)  =  supPeV  N(e,  C,  P). 

For  a  single  distribution  ,  the  conjecture  reduces  immediately  to  the  known  result  of  [4] 
(stated  in  Section  2).  For  every  distribution,  the  results  of  Section  3  imply  that  the  condition 
suPoH  P  N(e,  C,P)  <  oo  Ve  >  0  is  equivalent  to  the  condition  that  C  have  finite  VC  dimension. 
Hence,  the  conjecture  in  this  case  reduces  to  the  known  result  of  [6]  (stated  in  Section  2.  As  pointed 
out  in  (4],  the  case  where  V  is  finite  is  similar  to  the  case  of  a  single  distribution,  and  the  case 
where  V  contains  all  discrete  distributions  is  similar  to  the  case  of  all  distributions.  The  result  for 
all  discrete  distributions  follows  again  from  Section  3  since  supdiicrete  P  N(e,  C,  P)  <  oo  Ve  >  0  iff 
the  VC  dimension  of  C  is  finite. 

We  now  prove  some  results  for  more  general  classes  of  distributions.  Although  our  results  are 
far  from  verifying  the  conjecture  completely,  the  partial  results  we  obtain  are  consistent  with  it. 
They  also  provide  some  indication  of  when  the  set  of  learnable  concept  classes  is  or  is  not  enlarged 
by  requiring  learnability  for  only  a  class  of  distributions. 

One  natural  extension  to  considering  a  single  distribution  Pq  is  to  consider  the  class  of  all 
distributions  sufficiently  close  to  Pq.  One  measure  of  proximity  of  distributions  is  the  total  variation 
defined  as  follows.  First,  we  assume  that  we  are  working  with  some  fixed  cr-algebra  S  of  X.  Let 
“P*  denote  the  set  of  all  probability  measures  defined  on  S.  For  Pi,  P2  G  V*,  the  total  variation 
between  Pi  and  P2  is  defined  as 

||P1-P2||  =  8up|P1(A)-P2(A)! 

A€S 
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For  a  given  distribution  Po  and  0  <  A  <  1  define 

PV(P0,  X)  =  {P€V:  ||P  -Po!i<  A} 

Vv(Po,  A)  represents  all  probability  measures  which  are  within  A  of  Po  in  total  variation.  For  A  =  0, 
'Pv(PotO)  contains  only  the  distribution  Po,  and  for  A  =  1,  Vv(Po,  1)  contains  all  distributions. 

Another  possibility  for  generating  a  class  of  distributions  from  Po  utilizes  the  property  that  a 
convex  combination  of  two  probability  measures  is  also  a  probability  measure.  Specifically,  if  Pi 
and  P2  are  probability  measures  then  APi  +  (1  -  A)P2  is  also  a  probability  measure  for  0  <  A  <  1. 
One  interpretation  of  this  convex  combination  is  that  with  probability  A  a  point  is  drawn  according 
to  Pi,  and  with  probability  1  -  A  the  point  is  drawn  according  to  P2.  Given  a  distribution  Po  and 
0  <  A  <  1,  define 

P/(Po,A)  =  {(1  -  rtfPo  +  rjP  :  r,  <  A,P  €  P*} 

The  distributions  in  P/(Po,  A)  can  be  thought  of  as  those  obtained  by  using  Po  with  probability 
greater  than  or  equal  to  1  -  A  and  using  an  arbitrary  distribution  otherwise.  Note  that,  as  with 
Pv(Po,  A),  we  have  P/(Po,  0)  =  {Po}  and  P/(Po,  1)  =  V* . 

Both  P,(P0,  A)  and  P„(Po,A)  can  be  thought  of  as  “spheres”  of  distributions  centered  at  Po. 
i.e.  all  distributions  sufficiently  “close”  to  Po  in  an  appropriate  sense.  The  following  proposition 
verifies  the  conjecture  for  P/(Po,A)  and  P„(Po,  A)  and  shows  that  a  concept  class  is  learnable  for 
P,(P0,  A)  or  P„(P0,  A)  with  A  >  0  iff  it  is  learnable  for  all  distributions. 

Proposition  4  Let  C  be  a  concept  class,  Po  a  fixed  distribution,  and  0  <  A  <  1 .  Then  the  following 
are  equivalent: 

(i)  N(t,C,Vi(Po,\))  <  00  for  all  e  >  0 

(ii)  C  has  finite  VC  dimension 
(Hi)  C  is  learnable  for  Vi{Po,  A) 

Furthermore,  P/(Po,  A)  C  P„(Po,  A)  so  that  the  above  are  equivalent  for  Vv(Po,  A)  as  well. 

Proof:  (ii)  =>  (iii)  This  follows  from  the  results  of  [6]  (what  we  have  called  Theorem  1).  Namely, 
(ii)  implies  learnability  for  all  distributions  which  implies  learnability  for  Vt(Po,  A)  C  V* . 

(iii)  =>■  (i)  If  N(e,C,Vi{Po,X))  =  00  for  some  e  >  0,  then  for  every  M  <  00  there  exists 
Pm  6  Pj(Po,  A)  such  that  N(e,  C,  Pm)  >  M.  But  then  by  the  results  of  [4]  (what  we  have  called 
Theorem  2),  more  than  log2  N(e,C,  Pm)  >  log2(l  -  6)M  samples  are  required  to  learn  for  Pm- 
Since  M  is  arbitrary,  letting  M  — *  00  contradicts  the  fact  that  C  is  learnable  for  P/(Po,  A).  Thus, 
N(e,C,Vi(Po,X))  <  00  for  all  e  >  0. 


(i)  =>  (ii)  For  every  P  G  P*,  let  Q  =  (1  —  A)Po  +  A P  €  Pi(PotA).  If  ci,c2  C  X  are  any 

measurable  sets,  then 

do(ci,c2)  =  Q(ClAc2)  =  (1  -  A)P0(ciAc2)  +  AP(ClAc2) 

>  AP(cjAc2)  =  Xdp(ciAc2) 

Therefore,  N(\e,  C,  Q)  >  N(e,  C,  P)  and  so 

N(e,  C,V*)  =  sup  N(e,  C,  P)  <  sup  N( Xe,  C,  (1  -  A)P0  +  AP) 

PeT>*  PeT” 

=  sup  N(Xe,  C,  Q)  <  oo 
CeP,(P0,A) 

Hence,  from  the  results  of  Section  3,  C  has  finite  VC  dimension. 

Finally,  to  show  P/(Po,  A)  C  P„(Po,  A),  let  Q  €  Pi(Po,  A).  Then  Q  =  (1  -  t?)Po  4-  ryP  for  some 
P  €  P*  and  r)  <  A.  For  every  v4  €  5,  we  have 

|Q(,4)  -  P0(,4)|  =  1(1  -  rj)Po(A)  +  t?P(>1)  -  Po(4)| 

=  t^IP^-PoUJI^tj^A 

Therefore,  ||(?  -  P0||  <  A  so  that  Q  G  Vv(Po,  A).  g 

The  following  result  shows  that  learnability  of  a  concept  class  is  retained  under  finite  unions 
of  distribution  classes.  That  is,  if  a  concept  class  C  is  learnable  for  a  finite  number  of  sets  of 
distributions  V\,...,Vn  then  it  is  learnable  with  respect  to  their  union  V  —  U"_i P,.  This  is  to  be 
expected  if  the  conjecture  is  true  since  N(e,C,V)  =  max*  N(c,  C,  P,)  <  oo  iff  N(e,  C,  Pt)  <  oo  for 
i  = 

Proposition  5  Let  C  be  a  concept  class,  and  let  Pj , . . . ,  Pn  be  n  sets  of  distributions.  If  C  is 
learnable  with  respect  to  P,  for  i  =  1, . . .  ,n  then  C  is  learnable  with  respect  U"=1Pj. 

Proof:  Let  fi  be  an  algorithm  which  learns  C  with  respect  to  Vi,  and  let  m ^ (e,  6)  be  the  number  of 
samples  required  by  fi  to  learn  with  accuracy  e  and  confidence  6.  Define  an  algorithm  /  as  follows. 
Ask  for 

/  32,  n 

samples.  Using  the  first  maxj  m^,  5)  samples,  form  hypotheses  hi,...,  hr,  using  algorithms 
fi,...,fn  respectively.  Then,  using  the  last  ^  In  ^  samples,  let  /  output  the  hypothesis  h, 
which  is  inconsistent  with  the  smallest  number  of  this  second  group  of  samples.  We  claim  that  / 
is  a  learning  algorithm  for  C  with  respect  to  U”=1P,. 

Let  P  €  UjLjPj,  and  let  c  G  C.  Then  PgP*  for  some  k.  Since  the  fi  are  learning  algorithms 
with  respect  to  the  P<,  at  least  one  hi  (namely  ht.)  is  within  |  of  c  with  probability  (with  respect 
to  product  measures  of  P)  greater  than  1  -  5.  Given  that  h,  is  within  5  of  c  for  some  i,  the  proof 
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of  Lemma  4  from  [4]  shows  that  the  most  consistent  hypothesis  (on  the  second  group  of  samples) 
is  within  c  of  c  with  probability  greater  than  1  -  | .  Therefore,  if  A  denotes  the  event  that  at  least 
one  hi  is  within  5  of  c  then 

Pr{dp(f(samc(x)),c)  <  e}  =  Pr{dp(f(samc(x)),c)  <  e  |^4}  ■  Pr{A} 

>  (i - |)(i - 1) >  1 

Thus,  /  is  a  learning  algorithm  for  C  with  respect  to  U”=1P,  using  m(f,  6)  samples.  | 

Note  that  the  above  result  is  not  true  in  general  for  an  infinite  number  of  classes  of  distributions 
since  the  sample  complexity  of  the  corresponding  algorithms  may  be  unbounded  (i.e.,  we  may  have 
supj  N(e,  C,Vi)  =  00).  However,  even  if  N(e,  C,  Vi)  is  uniformly  bounded  the  proof  above  does  not 
go  through  since  the  application  of  Lemma  4  from  [4]  requires  finitely  many  hypotheses.  This  is 
essentially  the  difficulty  encountered  in  attempting  to  prove  the  conjecture  directly. 

For  a  finite  number  of  distributions  Pi, ... ,  Pn ,  define  their  convex  hull,  denoted  by  conv{P\, . . .  ,Vn), 
as  the  set  of  distributions  that  can  be  written  as  a  convex  combination  oi  P\, ...  ,Vn.  That  is, 

conv{Pi,. . . ,  P„)  =  {A1P1  + - 1-  AnP„  :  0  <  A,  <  1  and  Ai  + - h  A„  =  1} 

We  now  prove  the  following  proposition. 

Proposition  6  Let  C  be  a  concept  class  and  let  P\, ...,  Pn  be  probability  measures.  The  following 
are  equivalent: 

(i)  C  is  leamable  with  respect  to  Pi  for  each  i  =  1, . . . ,  n. 

(ii)  N(e ,  C,  conv(Pi,. . . ,  P„))  <  00  for  all  e  >  0. 

(Hi)  C  is  leamable  with  respect  to  conv(Pi,  ...,Pn). 

Proof:  (iii)  =>  (i)  This  is  immediate. 

(i)  =>  (ii)  Since  C  is  learnable  with  respect  to  Pi  for  each  i,  by  Theorem  2  we  have  N (  e,C,P<)< 

00  for  all  e  >  0  and  i  =  1  Let  Ni{e)  =  N{e,  C,  Pi)  and  let  (h,i,  •  -  -  ,Ci,Ni(e/2)  be  an  |- 

approximation  of  C  with  respect  to  dp.  For  each  i  =  l,...,n,  let  C,j  =  {c  €  C  :  dpt(c, c,j)  <  |} 

for  j  =  1, . . . ,  Nj(|).  We  have  C  =  U^^Cij  for  all  i  =  1, . . . , n.  Let 

n 

Ck i,...,k„  —  Pi  Cl,** 

i=l 

for  1  <  ki  <  Ni(^),  i  =  1, . . . , n.  Clearly, 

C  =  U  . 

all  (fcl . fcn) 


16 


Also,  by  construction  the  ‘diameter’  of  each  C*j  with  respect  to  dPl  is  less  than  or  equal  to  e 
for  all  i  =  1, . . . ,  n,  i.e.  for  each  t  =  l,...,nwe  have 


sup  dpt(ci,c2)  <  ( 

Cl  ,C2tCkj 

Hence,  if  we  define  a  metric  />(•,  •)  by 

P(ci,c2)  =  max  dPi{cuc2) 

l<i  <n 

then  N(e,  C,  p)  <  I"I?=i  ^<(5)  <  00  since  we  can  form  an  e-approximation  of  C  with  respect  to  p  by 
simply  taking  any  point  from  each  Cku...,kn  that  is  nonempty. 

Now,  if  Q  €  conv(Pi,  ...,Pn)  then  Q  =  £"=1  AjP,  for  some  0  <  A,  <  1  with  £?=  1  A,  =  1.  For 
any  measurable  ci,c2  C  X,  we  have 


dg(ci,c2)  =  Ajdp,(ci,c2) 


£  (P‘)® 


max  dpt(ci,c2)  =  p(cuc2) 

l<t  <n 


so  that  N(e,C,Q)  <  N(e,C,p).  Thus, 


N(e,  C, conv(Pu . .  • ,  P„))  =  sup  N(e,  C,Q)  <  ft  Nt(~)  <  oc 

Qeconv(Pi . Pn)  j=1  *■ 

(ii)  =>  (iii)  If  N(e,  C,  conv(P\,. . . ,  Pn))  <  00  for  all  e  >  0,  then,  in  particular,  7V(e,  C,  P,)  <  oc 
for  t  =  1, . . . ,  n  and  e  >  0.  Therefore,  we  can  employ  the  construction  used  above  in  proving  that 
(i)  implies  (ii)  to  get  a  finite  ^-approximation  of  C  uniformly  for  all  Q  €  conv(P\, . . . ,  Pn).  As 
shown  above,  such  an  approximation  can  be  found  with  less  than  or  equal  to  11?=  1  N,( | )  elements 
where  JV,(|)  =  JV(|,  C,  P,).  Thus,  using  the  proof  of  Lemma  4  from  [4],  the  algorithm  which  takes 
Y  (in  3  +  £?=i  In  Nj(!))  samples  and  outputs  an  element  of  the  ^-approximation  with  the  smallest 
number  of  inconsistent  samples  is  a  learning  algorithm  for  C  with  respect  to  conv{P\,. . . ,  Pn).  g 


The  above  proposition  verifies  the  conjecture  for  classes  of  distributions  which  are  “convex 
polyhedra  with  finitely  many  sides”  in  the  space  of  all  distributions.  In  fact,  combined  with  the 
previous  proposition,  the  conjecture  is  verified  for  all  finite  unions  of  such  polyhedra. 
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5.  SUMMARY 


It  was  first  pointed  out  that  the  condition  for  learnability  with  respect  to  a  fixed  distribution 
obtained  in  [4]  is  identical  to  the  notion  of  finite  metric  entropy.  Metric  entropy  has  been  studied 
elsewhere,  and  perhaps  results  from  that  literature  may  have  applications  to  concept  learning.  In 
considering  relationships  between  the  VC  dimension  of  a  concept  class  and  its  metric  entropy,  we 
extended  a  result  of  [4]  and  stated  an  earlier  result  from  [8].  Finally,  we  proved  some  partial 
results  concerning  learnability  with  respect  to  a  class  of  distributions.  These  results  are  consistent 
with  a  conjecture  in  [4].  Specifically,  it  was  shown  that  the  conjecture  holds  for  any  “sphere” 
of  distributions  and  for  any  set  of  distributions  which  is  a  finite  union  of  “convex  polyhedra  with 
finitely  many  sides.”  In  addition  to  verifying  the  conjecture  in  these  cases,  the  results  indicate  some 
limitations  of  attempting  to  enlarge  the  set  of  learnable  concept  classes  by  requiring  learnability 
only  for  a  class  of  distributions  as  opposed  to  all  distributions. 

In  closing,  we  briefly  mention  some  other  work  that  has  been  done  on  Valiant’s  learning 
framework.  (Note  that  this  is  not  intended  to  be  a  complete  survey.)  A  considerable  amount  of 
work  has  been  done  on  studying  specific  learnable  concept  classes  taking  into  consideration  issues  of 
computational  difficulty.  In  fact,  much  of  [23]  focused  on  certain  special  classes  of  Boolean  functions 
(see  also  [15,20,24]).  Several  papers  have  dealt  with  the  interesting  issue  of  noise  in  the  samples 
[2,14,21,24].  A  result  concerning  noisy  samples  was  also  given  in  [4]  for  the  case  of  learnability  with 
respect  to  a  fixed  distribution.  Another  interesting  idea  involves  the  introduction  of  a  measure  of  the 
complexity  of  concepts,  and  allowing  the  number  of  samples  to  depend  on  this  complexity  This  has 
been  studied  in  [6,7,5,10,18].  We  stated  our  definitions  of  learnability  in  terms  of  both  the  concept 
class  C  and  the  hypothesis  class  H,  but  assumed  throughout  that  H  D  C.  Considerations  in  the 
more  general  case  have  been  discussed  in  [1,3,18].  The  use  of  more  powerful  oracles  (i.e.,  protocols 
which  allow  the  learner  to  get  information  other  than  just  random  samples)  have  been  considered 
in  [1,23]  Finally,  [19]  has  considered  learnability  of  continuous  valued  functions  (as  opposed  to  the 
usual  binary  valued  functions). 
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